Building an AI Research Assistant: Large Language Models as a Versatile Tool for Digital Historians¶
©<AUTHOR or ORGANIZATION / FUNDER>. Published by De Gruyter in cooperation with the University of Luxembourg Centre for Contemporary and Digital History. This is an Open Access article distributed under the terms of the Creative Commons Attribution License CC-BY
©<AUTHOR or ORGANIZATION / FUNDER>. Published by De Gruyter in cooperation with the University of Luxembourg Centre for Contemporary and Digital History. This is an Open Access article distributed under the terms of the Creative Commons Attribution License CC-BY-NC-ND
xxxxxxxxxxfrom IPython.display import Image, displaydisplay(Image("./media/placeholder.png"))Large language models, GPT-4, artifical intelligence, machine learning, historical methodology, optical character recognition, oral history, prompt engineering
This article explores into the potential applications and implications of large language models (LLMs) for historical research and pedagogy. The article examines the capabilities of GPT-4 and other machine learning models through case studies studying their utility in fostering greater accessibility to historical sources. GPT-4's performance is evaluated on a series of prompted tasks including data preparation, source analysis, and the ethical implications of simulated historical worldviews. GPT-4's proficiency in historical knowledge is also evaluated by using a widely recognized machine learning benchmark. A replication study demonstrates that GPT-4 exhibits expert-level performance in three distinct historical subfields. Given the rapid advances in LLMs, historians should contribute wider debates surrounding these technologies, as the unpredictable impacts of democratized AI on historical knowledge are already emerging.
In the article's hermeneutical layer, the author explores the practice of prompt engineering, or the techniques for using natural language instructions to guide a LLM's output. Prompt engineering strategies are demonstrated through the use of few-shot prompting, chain-of-thought reasoning, and prompt chaining.
Introduction¶
In 2003, Roy Rosenzweig predicted that digital historians would need to develop new techniques "to research, write, and teach in a world of unheard-of historical abundance." () Over the past two decades, historians have risen to this challenge, embracing digital mapping, network analysis, and distant reading of large text collections as part of their methodological toolkit. () Machine learning (ML) has also emerged as a promising field in computational history, offering rapid analysis of vast datasets and making significant contributions to historical research. The range of recent applications employing ML techniques is impressive: restoring lost fragments of ancient papyri (), automated transcription of medieval manuscripts (), identifying soldiers from the U.S. Civil War through facial recognition (), and analyzing the archival records of South Africa's Truth and Reconciliation Commission to spatially and semantically map the Apartheid era ().
Yet as ML techniques increasingly become a part of the digital historian's toolkit, new research frontiers are emerging in the development of "foundational models" of artificial intelligence (AI), which possess striking capacities across a range of modalities. The scope of these capabilities and their broader implications remain intensely debated () Despite being in its infancy, this technology offers a variety of methodological possibilities for historians.
However, these advances also give historians cause for concern. The proliferation of "fake history" is a clear risk. Historians typically employ ML methods for analytical purposes, using advanced pattern recognition to interpret large collections of historical data. Yet the same techniques that provide algorithms with insights into these sources can also enable the creation of synthetic data resembling genuine artifacts. While the historical record contains numerous forgeries, advances in "generative AI" may facilitate the production of convincing disinformation on an unprecedented scale and with uncanny verisimilitude. "Deepfakes" and other forms of digital fabrication could exacerbate what many already consider a broader epistemological crisis (). Given the rapid pace of these advances, it is crucial that the profession addresses the implications of this technology. Historians will have much to contribute in contextualizing the innovative and disruptive potential of these breakthroughs.
Our engagement with this technology, however, cannot be limited to critiques from the sidelines. Historians have a compelling interest in directly engaging with both the perils and potential of "generative AI." Researchers are already exploring how these foundational models interpret historical knowledge, yielding remarkable results. Nevertheless, a critical discourse has emerged, raising key questions about how these models achieve their capabilities, the limits of their performance, and their propensity to reinforce existing inequalities. There are rich opportunities for historians to contribute to these debates. The growing maturity of these applications enables increasingly democratized access to advanced ML models (). This article proposes a starting point for historical exploration of AI by examining the capacities of one of the most studied (and controversial) AI systems: the GPT series of large language models (LLMs).
The Possibilities and Perils of Generative AI¶
xxxxxxxxxxAs historians explore the possibilities of what is sometimes called generative AI, it is important to understand how they are created and function. With this knowledge we can better asses their strengths, weaknesses, and potential impact on historical research.
Large language models are based on gargantuan datasets created through large-scale "scraping" of the internet. Machine learning scientists employ deep learning techniques to develop statistical models that are trained on these extensive datasets. With sufficient time and computational power, the LLMs begin to exhibit a range of "emergent" capabilities (). The nature of these capacities remains a matter of intense research and debate, as do the ethical and legal questions surrounding their use. However, it is clear that these models can both interpret and generate data in ways that surpass previous ML methods. Scholars studying these AI systems have labeled them "foundational models" due to their potential to enable new domains of computational analysis ().
A variety of these foundational models have become publicly accessible. Among the best-known and most studied models is the Generative Pre-trained Transformer series by OpenAI, commonly referred to by the acronyms GPT-3, GPT-4, and ChatGPT. The GPT models are trained to statistically predict the next sequence of words (tokens) for a given text. Common examples of this capacity are the autocomplete functions in text messaging and Gmail. However, the capabilities of GPT models extend far beyond these simple applications. With their predictive abilities, GPT-3 and its successors can summarize texts, perform language translation, write working computer code, and compose strikingly informative responses on a wide array of subjects (). Indeed, the remarkable versatility of LLMs is stimulating broader discussions about the potential implictions of these technologies on society at large. ().
While the GPT series from OpenAI is the best known of these foundational models, a growing number of both commercial and open-source alternatives are available for researchers and the general public. Notable commercial LLMs include Google's Bard, Anthropic's Claude, and the models offered by Cohere. Open-source LLMs encompass BLOOM, a multilingual LLM created by the Big Science Research Workshop (), EleutherAI's GPT series, Google's FLAN instruct series (), and Meta's LLaMA models (). Additionally, several notable LLMs have been studied by researchers but are not yet accessible to the broader public, such as Google's PaLM () and DeepMind's Chinchilla ().
Foundational models are also emerging in other domains, such as image synthesis. Models like CLIP () power OpenAI's DALL-E, Midjourney, and the open-source community behind Stable Diffusion. Remarkable advances are occurring in foundational models for audio and video synthesis as well. Perhaps most notably, a combination of these models has enabled the creation of multi-modal AIs capable of working across multiple domains, such as GPT-4. ()
For a survey of the development and capacities of these models as of April 2023, see: ()
xxxxxxxxxxWhile such claims elicit both excitement and alarm, any assessment of LLMs must first be tempered with humility. LLMs are often described as possessing "knowledge" and "understanding," yet direct engagement with these models can quickly reveal both their remarkable breadth and their narrow limits. Incisive critics of this technology characterize LLMs as "stochastic parrots" that excel at uncanny mimicry of human intelligence (). A form of this mimicry has proven convincing in the past. The first attribution of artificial intelligence to a computer program occurred in 1966 with a scripted chatbot named ELIZA, developed by AI pioneer Joseph Weizenbaum (, 289-298). A recent replication of this phenomenon occurred in June 2022 when a Google AI engineer declared the LLM he was training had become sentient (). Such attributions will likely increase as newer LLMs demonstrate increasing proficiency in seemingly distinct human qualities, like humor (, 39).
The means by which LLMs process, interpret, and generate information is a highly technical field requiring specialization in natural language processing, statistics, computational linguistics, and machine learning. As a historian, I lack the technical knowledge to evaluate the merits of these debates. However, I can assess the GPT series in terms of their historical accuracy. From my observations, LLMs offer significant opportunities to for digital historians, but also cause for concern.
Judge for yourself in the following example. In the code below, GPT-4 is prompted to generate responses to the introduction of the "Philosophy of History" entry in the Stanford Encyclopedia of Philosophy by Daniel Little (). GPT-4's responses do not rely on any training or previous input from the user, nor does using GPT-4 require any specialized programming knowledge or hardware – only an OpenAI account and an API key or direct access via the ChatGPT browser.
xxxxxxxxxx# installing openai!pip install openaixxxxxxxxxx# Enter OpenAI API key in the space below, after 'sk-'.# Access to OpenAI's API keys can be found here: https://beta.openai.com/signupimport osos.environ["OPENAI_API_KEY"] = "sk-"xxxxxxxxxx# The following is the introductory paragraph to: Little, Daniel, "Philosophy of History", The Stanford Encyclopedia of Philosophy (Spring 2022 Edition), Edward N. Zalta (ed.), URL = <https://plato.stanford.edu/archives/spr2022/entries/history/>from IPython.display import Markdownmain_text = "Main Text:\n\nThe concept of history plays a fundamental role in human thought. It invokes notions of human agency, change, the role of material circumstances in human affairs, and the putative meaning of historical events. It raises the possibility of “learning from history.” And it suggests the possibility of better understanding ourselves in the present, by understanding the forces, choices, and circumstances that brought us to our current situation. It is therefore unsurprising that philosophers have sometimes turned their attention to efforts to examine history itself and the nature of historical knowledge. These reflections can be grouped together into a body of work called “philosophy of history.” This work is heterogeneous, comprising analyses and arguments of idealists, positivists, logicians, theologians, and others, and moving back and forth over the divides between European and Anglo-American philosophy, and between hermeneutics and positivism.\n\nGiven the plurality of voices within the “philosophy of history,” it is impossible to give one definition of the field that suits all these approaches. In fact, it is misleading to imagine that we refer to a single philosophical tradition when we invoke the phrase, “philosophy of history,” because the strands of research characterized here rarely engage in dialogue with each other. Still, we can usefully think of philosophers’ writings about history as clustering around several large questions, involving metaphysics, hermeneutics, epistemology, and ethics: (1) What does history consist of—individual actions, social structures, periods and regions, civilizations, large causal processes, divine intervention? (2) Does history as a whole have meaning, structure, or direction, beyond the individual events and actions that make it up? (3) What is involved in our knowing, representing, and explaining history? (4) To what extent do facts about human history create moral responsibilities for the present generation?"display(Markdown("""Little, Daniel. "Philosophy of History", The Stanford Encyclopedia of Philosophy (Spring 2022 Edition), Edward N. Zalta (ed.) <https://plato.stanford.edu/archives/spr2022/entries/history/>\n\n""" + main_text))xxxxxxxxxxquestions = "Prompt 1: Summarize the Main Text.\n\nPrompt 2: Translate the first sentence of the Main Text into German.\n\nPrompt 3. Compose a beautiful and evocative haiku with a 5-7-5 syllabic structure on the Main Text.\n\nPrompt 4: Based on the heterogenous schools mentioned in the Main Text, identify which approach best describes Hegel's philosophy of history, and why.\n\nPrompt 5: Based on the main text, provide a brief annotated bibliography of relevant and real sources for further reading in the Chicago Manual of Style format. Then identify the version of the Chicago style guide used for the citations."display(Markdown("Questions posed to GPT-4 about the above passage\n\n" + questions))xxxxxxxxxx# OpenAI completion using the GPT-4 model.import openaiquery = openai.ChatCompletion.create( model="gpt-4", messages=[ {"role": "assistant", "content": main_text}, {"role": "user", "content": questions} ] ) output = query['choices'][0]['message']['content']display(Markdown("GPT-4's Response:\n\n" + output))While acknowledging the caveats noted earlier, GPT-4's responses to these prompts reveal an impressive range of its abilities to both "interpret" language and generate plausible responses to instructions. Prompt 1's text summary is concise, accurate, and accessible. Prompt 2 provides an accurate translation of the first sentence of Little's introduction. Prompt 3 presents a haiku close to a 5-7-5 structure that aptly captures the spirit of the main text. Prompt 4 correctly associates idealism with Hegel, the figure linked to the philosophical approach of Absolute Idealism ().
Prompt 5 is revealing in other respects, particularly concerning the shortcomings of these models. The citations seemingly offer appropriate works on the philosophy of history, and GPT-4 excels in even identifying bibliographic details. However, most LLMs are quite prone to generating citations where the author, publisher, and publication date cannot be confirmed with reference databases like WorldCat – in other words, they are invented by the model.
These inaccuracies are a phenomenon described by AI researchers as "hallucinations." Such hallucinations represent a major challenge in LLM research and for practical applications of this technology. () Indeed, the confident assertions of such factual inaccuracies pose a real danger in many domains, particularly given the remarkable effectiveness of these models in generating convincing and otherwise accurate prose. Detecting such errors can be difficult, as initial testing by OpenAI on the GPT series demonstrated that human readers often struggle to identify text generated by LLMs ( 16, table 7.3). Rectifying such hallucinations is a significant area of research. However, some scholars, like computational linguist Emily Bender, argue that such behaviors are inherent flaws in LLMs (). Given this tendency, historians must exercise great caution when employing this technology.
Per Bender: “GPT-3 can’t ‘compose nonfiction’. Nonfiction by definition is factual writing about the world. But GPT-3 has no access to facts, only to strings in its training data. To the extent that it outputs strings of words that humans interpret and can verify as factual, that factuality is and can only ever be purely accidental.”
The capabilities of LLMs, along with their flaws, stem from the vast dataset used to train them: the Internet itself. The data collection built for training GPT-3 encompassed the majority of English-language Wikipedia, Reddit's thousands of discussion forums, extensive corpora of digitized books, and a filtered (yet immense) collection of billions of web pages contained in the Common Crawl repository (, 8-9). While OpenAI's training process aimed to remove potentially offensive texts, the sheer scale of the dataset made selective curation impossible. Consequently, LLMs generate responses reflecting both the best and the worst of our online world.
This reality has troubled previous AI implementations. Well-intentioned researchers have created chatbots that spew hateful invective, human resources applications that refuse to hire female applicants, and algorithms based on criminal justice sentencting guidelines that starkly reinforce racial disparities already prevalent in the carceral system (). The GPT series has been known to unexpectedly generate responses in innocuous contexts containing violent imagery, sexually explicit language, and racial, ethnic, and religious slurs (). These findings further confirm the prescient warnings offered by scholars such as Safiya Umoja Noble (), Timnit Gebru (), Ruha Benjamin (), Kate Crawford (), and Trevor Paglen () on digital practices that reinforce analog inequalities. Some AI researchers consider such behaviors as lamentable but solvable problems through further technical advances. Reducing the impact of biases is a significant research area, particularly through the creation of smaller, more carefully curated datasets for AI training. However, many historians will likely share the skepticism of some researchers concerning such mitigations. () Bias emerges from more than just explicit language or imagery but also from the very structures of societies. Can any historical source be separated from its context as a neutral artifact, free of its creator's perspective and the influences of its time? What about the untold millions of sources that make up the scale of an LLM's training set?
Historians should contribute to the broader dialogue about the implications of these technologies, especially as they become increasingly embedded in our digital lives. Yet, such flaws do not mean LLMs have no place in the historian's toolkit. In fact, by acknowledging and confronting these shortcomings, historians can better contribute our disciplinary perspective on the debates concerning this technology – particularly in leveraging the strengths of these models to empower and broaden accessibility. The case studies below demonstrate how foundational AI models can serve as a versatile tool for both researching and communicating the past.
Case Study: Oral History Transcriptions¶
A promising approach for LLMs and other foundational AI models is for data preparation and cleanup. A general rule of thumb is that 80% of the labor involved in data analysis is dedicated to preparing the data (, ix)). AI models hold significant potential to streamline and accelerate the challenging work of creating "tidy datasets" ().
Oral history provides a particularly useful case study for demonstrating the potential value of AI models. Transcription of audio recordings is a central activity in this methodology, but transcriptions often require considerable expense, labor, and time. However, advances in machine learning models have resulted in impressive gains in streamlining this task. Notable among these models is Whisper, an open-source audio transcription model developed by OpenAI that belongs to the same Transformer family as the GPT series (). Let's test Whisper on the first two minutes of a transcribed oral history of historian John Hope Franklin by the Southern Oral History Program ().
xxxxxxxxxxfrom IPython.display import Audiofile_path = "media/A-0339_edited.mp3"Audio(file_path)OpenAI released Whisper as a series of open-source machine learning models freely avaliable and hosted on HuggingFace. () However, for simplicity this demonstration code uses OpenAI's API for Whisper. As of March 2023, OpenAI charged $0.36 per hour of recorded time for transcriptions using the API.
xxxxxxxxxx# Below are the modules and libraries needed for running the case studies below.# Requires Python version 3.8!pip install pandas!pip install Pillow !pip install openai[embeddings]!pip install langchain!pip install matplotlib seabornxxxxxxxxxx# Below is the transcript for the first 1:30 of the following oral history:# Interview with John Hope Franklin by John Egerton, July 27, 1990. Interview A-0339, Collection #4007. Southern Oral History Program Collection, Southern Historical Collection, Wilson Library, University of North Carolina at Chapel Hill. https://docsouth.unc.edu/sohp/A-0339/menu.html original_transcript = "JOHN EGERTON: I know your historical, personal background, about your parents meeting at Walden. You know, we talked. . . .\n\nJOHN HOPE FRANKLIN: At Roger Williams.\n\nJOHN EGERTON: At Roger Williams. We've talked about that before, and about how you got to Nashville from Oklahoma and all that. But I want to kind of pick up about the time when you were an undergraduate at Fisk in the '30s, and ask you first, well, a couple of things. One, do you recall any meeting, interracial meetings, that took place on the Vanderbilt campus during those years?\n\nJOHN HOPE FRANKLIN: No.\n\nJOHN EGERTON: Never happened?\n\nJOHN HOPE FRANKLIN: No, never happened so far as I know.\n\nJOHN EGERTON: At Fisk, yes, but at Vanderbilt, no?\n\nJOHN HOPE FRANKLIN: That's right.\n\nJOHN EGERTON: The people from Vanderbilt would come over there, but not the other way around?\n\nJOHN HOPE FRANKLIN: That's right. And I don't whether you remember the famous meeting—maybe then I would have to back up and say I know of one—where a number of people, distinguished sociologists, probably Robert Park, people like that. I'm not certain who they were. They had a meeting out at Vanderbilt and invited E. Franklin Frazier. It might even have been a luncheon. And I think Chancellor Kirkland learned about and simply blew his stack.\n\nJOHN EGERTON: This would have been in that period when you were an undergraduate.\n\nJOHN HOPE FRANKLIN: Yes. It would have been because, you see, Frazier left at the end of my junior year. Went to Harvard in 1934. Other incidents that I remember in Nashville and at Vanderbilt was when, in my senior year, the spring of my senior year, I was an applicant for admission to Harvard to go to graduate school. This is before the GRE's, you see. So they wanted me to take a scholastic Aptitude Test, and, of course, it was scheduled, like the GRE's, at a certain time and place. And it was at Vanderbilt, and it was in a certain room on Vanderbilt campus. I went there."display(Markdown("Human Transcription:\n\n" + original_transcript))xxxxxxxxxx# This code transcribes the first 2:33 minutes of the interview above using OpenAI's Whisper API.# Information about using Whisper can be found here: https://platform.openai.com/docs/guides/speech-to-textimport timeaudio_file = open("media/A-0339_edited.mp3", "rb")start_time = time.time()whisper_output = openai.Audio.transcribe("whisper-1", audio_file)end_time = time.time()whisper_transcript = whisper_output['text']diarized_whisper_transcript = "JOHN EGERTON: What I'd like to do, I'd like to, I know, I know your historical personal background about your parents meeting at Walden and you know we've talked about that at Roger Williams, we've talked about that before and about how you got to Nashville from Oklahoma and all that, but I want to kind of pick up about the time when you were an undergraduate at Fisk in the 30s and ask you first a couple of things. One, do you recall any meetings, interracial meetings that took place on the Vanderbilt campus during those years?\n\nJOHN HOPE FRANKLIN: No.\n\nJOHN EGERTON: Never happened?\n\nJOHN HOPE FRANKLIN: No. Never happened so far as I know.\n\nJOHN EGERTON: At Fisk yes, but at Vanderbilt no.\n\nJOHN HOPE FRANKLIN: That's right.\n\nJOHN EGERTON: The people from Vanderbilt would come over there but not the other way around.\n\nJOHN HOPE FRANKLIN: That's right and I don't know whether you remember the famous meeting, maybe then I would have to back up and say I know of one, where a number of people, distinguished sociologists, probably Robert Park and people like that, I'm not certain who they were, they had a meeting out at Vanderbilt and invited E. Franklin Frazier out there. It might even have been a luncheon and I think Chancellor Kirkland learned about it.\n\nJOHN EGERTON: This would have been in that period when you were an undergraduate?\n\nJOHN HOPE FRANKLIN: It would have been because you see Frazier left at the end of my junior year when we went to Howard in 1934. The other instance that I remember of an international and at Vanderbilt was when in my senior year, the spring of my senior year, I was an applicant for admission to Harvard to go to graduate school. This is before the GRE. So they wanted to take a scholastic aptitude test. I and of course it was scheduled, like the GRE, at a certain time and place. At a certain place. And it was at Vanderbilt and it was in a certain room on Vanderbilt campus. And I went there."automation_time = end_time - start_timedisplay(Markdown(f"Whisper Transcription time: {automation_time} seconds\n" + "\n\n" + "Whisper Transcript:\n\n" + whisper_transcript))#Markdown(whisper_transcript)Whisper provides state-of-the-art transcriptions, but as of March 2023 doesn't perform speaker diarization. To faciltiate text comparasion, I've added diarization to Whisper's transcription in the 'diarized_whisper_transcript' variable. A variety of tutorials have been published for implementing speaker diarization, timestamping, and other features for Whisper. ()
xxxxxxxxxx# Script for visualizing transcripts. Coded with assistance from GPT-4.import difflib from difflib import SequenceMatcher from IPython.display import display, HTMLdef highlight_char_diff(line1, line2): matcher = difflib.SequenceMatcher(None, line1, line2) html_line1 = "" html_line2 = "" for tag, i1, i2, j1, j2 in matcher.get_opcodes(): if tag == "equal": html_line1 += line1[i1:i2] html_line2 += line2[j1:j2] elif tag == "replace": html_line1 += f'<span style="background-color: #ffaaaa;">{line1[i1:i2]}</span>' html_line2 += f'<span style="background-color: #aaffaa;">{line2[j1:j2]}</span>' elif tag == "delete": html_line1 += f'<span style="background-color: #ffaaaa;">{line1[i1:i2]}</span>' html_line2 += " " * (i2 - i1) elif tag == "insert": html_line1 += " " * (j2 - j1) html_line2 += f'<span style="background-color: #aaffaa;">{line2[j1:j2]}</span>' return html_line1, html_line2def compare_transcripts_v2(transcript1, transcript2): differ = difflib.unified_diff(transcript1.splitlines(), transcript2.splitlines(), lineterm="") diff_table = "<table>" line_counter = 1 for line in differ: if line.startswith("+"): _, highlighted_line = highlight_char_diff("", line[1:]) diff_table += f'<tr><td style="text-align: right;">{line_counter}</td><td style="text-align: left;">{highlighted_line}</td></tr>' elif line.startswith("-"): highlighted_line, _ = highlight_char_diff(line[1:], "") diff_table += f'<tr><td style="text-align: right;">{line_counter}</td><td style="text-align: left;">{highlighted_line}</td></tr>' elif line.startswith("@@"): diff_table += f'<tr><td style="text-align: left; background-color: #e0e0e0;" colspan="2">{line}</td></tr>' else: diff_table += f'<tr><td style="text-align: right;">{line_counter}</td><td style="text-align: left;">{line}</td></tr>' line_counter += 1 diff_table += "</table>" return diff_table#html_comparison_v2 = compare_transcripts_v2(diarized_whisper_transcript, original_transcript)html_comparison_v2 = compare_transcripts_v2(original_transcript, diarized_whisper_transcript)display(HTML(f'<p><strong>Differences between original transcript (red) vs. Whisper (green):</strong></p>'))display(HTML(html_comparison_v2))Several notable takeaways can be observed in comparing the original transcript with Whisper's transcription. Whisper misses Franklin's interjection in section 1, omits part of the final sentence in section 11, and misidentifes "Nashville" for "international" in section 13. It also offers some variations in puncuation for sections 6, 7, and 10, and transcribes common filler words usually omitted.
Human review is still required to ensure a faithful transcription. Yet those reviewers will start with a remarkably accurate transcript produced quickly and at little cost. Applications like Whisper hold significant potential for oral historians, as they can dramatically improve the efficiency and cost-effectiveness of transcription workflows. As this technology continues to advance, it is likely that the accuracy of tools like Whisper will only improve, further enhancing their utility for oral historians and allowing them to focus more on the analysis and interpretation of these sources.
Case Study: Error Correction of Optical Character Recognition Scans¶
Another potential use case of AI models for digital historians is the error correction of optical character recognition (OCR) scans. Machine learning techniques, such as those pioneered by the research team at Transkribus, have greatly enhanced the quality, speed, and cost of OCR scans across a broad range of historical texts. () However, even high-fidelity OCRs possess error rates that have insidious impacts on the accessibility and searchability of texts collections. () For instance, the image below comes from a newspaper published in a German prisoner-of-war camp in Mississippi during World War II and later microfilmed by the Library of Congress. Let's compare an OCR scan of this image via Google's Cloud Vision OCR service with a human transcription of the same text.
xxxxxxxxxx# Source: "Nur ein Film?." Die Lotse (Camp McCain, Mississippi), 30 June 1945. In: Karl John Richard Arndt, editor. German P.O.W. camp papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9.from PIL import Imageimage = Image.open('media/die_lotse_6-30-45_1.png')new_width = 600new_height = int(image.height * (new_width / image.width))# Resize the imageresized_image = image.resize((new_width, new_height), Image.LANCZOS)# Display the resized image#display(Markdown(""""Nur ein Film?." *Die Lotse* (Camp McCain, Mississippi), 30 June 1945. In: Karl John Richard Arndt, editor. German P.O.W. Camp Papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9."""))metadata={ "jdh": { "object": { "type":"image", "source": [ "Nur ein Film?. *Die Lotse* (Camp McCain, Mississippi), 30 June 1945. In: Karl John Richard Arndt, editor. German P.O.W. Camp Papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9." ] } }}display(resized_image, metadata=metadata)xxxxxxxxxx# This script compares an OCR output of the image above with human transcription. # Words in red are from the OCR corrections, words in green are from the human transcription.import difflibfrom IPython.display import display, HTMLocr_output_1 = 'NUR EIN FILM? NUR EIN\nUnverloeschbar tief haben sich uns die Bilder des Grauens einge- praegt, die jeder von uns dieser Tage in dem ersten amerikanischen Armeefilm aus Deutschland sah. Er schuetterung und Ent setzen haben Jeden Fuehlenden verstummen las sen, aber die Unmenschlichkeit, von "Deutschen" auf deutschem Bo- den begangen, lassst den Gesitte- ten nicht schweigend darueberhin- gehen.\nDante setzt in seinem Werk: 1 "Die goettliche Komoedie" ueber den Eingang zur Hoelle die Worte: "Lasst fahren alle Hoffnungen 1hr, die ihr hier eintritt."\nDiese Worte koonnen ueber je- dem K.Z.-Lager Deutschlands ge standen haben; denn die Bilder des Schreckens und Grauens, wie sie Dante von der Hoelle entwirft, ver- blassen vor dieser schaurigen Wirklichkeit, die sich hier auf Er den unter lebenden Menschen im Herzen Europas abspielte. Was wir sahen, war dabei wohl nur ein kleiner Ausschnitt, wenn wir beden- ken, dass diese Tragoedie seit 1933 unzaehlige Opfer forderte.'human_corrected_output_1 = 'NUR EIN FILM?\nUnverloeschbar tief haben sich uns die Bilder des Grauens eingepraegt, die jeder von uns dieser Tage in dem ersten amerikanischen Armeefilm aus Deutschland sah. Erschuetterung und Entsetzen haben jeden Fuehlenden verstummen lassen, aber die Unmenschlichkeit, von "Deutschen" auf deutschem Boden begangen, laesst den Gesitteten nicht schweigend darueberhingehen.\nDante setzt in seinem Werk: "Die goettliche Komoedie" ueber den Eingang zur Hoelle die Worte: "Lasst fahren alle Hoffnungen ihr, die ihr hier eintritt."\nDiese Worte koennen ueber jedem K.Z.-Lager Deutschlands gestanden haben; denn die Bilder des Schreckens und Grauens, wie sie Dante von der Hoelle entwirft, verblassen vor dieser schaurigen Wirklichkeit, die sich hier auf Erden unter lebenden Menschen im Herzen Europas abspielte. Was wir sahen, war dabei wohl nur ein kleiner Ausschnitt, wenn wir bedenken, dass diese Tragoedie seit 1933 unzaehlige Opfer forderte.'differ = difflib.Differ()diff1 = list(differ.compare(ocr_output_1.split(), human_corrected_output_1.split()))def ocr1_vs_human_1(diff1): result1 = [] for word in diff1: if word.startswith('+'): result1.append(f'<span style="color:green;background-color:#e6ffe6;">{word[2:]}</span>') elif word.startswith('-'): result1.append(f'<span style="color:red;background-color:#ffe6e6;">{word[2:]}</span>') elif word.startswith(' '): result1.append(word[2:]) return ' '.join(result1)colored_diff_1 = ocr1_vs_human_1(diff1)display(HTML(f'<p><strong>Differences between OCR Output (red) vs Human Transcription (green):</strong></p><p>{colored_diff_1}</p>'))While the image quality is satisfactory and the text is printed using modern typefaces, the OCR still generates errors requiring human correction. Correcting such errors necessitates substantial review and intervention, representing a significant labor for processing a sizable text corpus. However, LLMs can expedite this correction process when guided by a carefully designed prompt. This practice, often referred to as "prompt engineering," is a method used to direct LLMs in completing specific tasks. Details concerning each prompt and the methods behind it are described in the hermeneutical layer.
Let's now see how GPT-4 interprets this prompt, and compare the generated corrections with the initial OCR scan.
xxxxxxxxxx# Prompt 1: OCR Correctionocr_prompt = """You are an AI research assistant with a specialty in correcting errors in OCR scans of newspaper images. In the following task you will be given a OCR'd text, and you will generate corrections using the following Task Format. Follow the instructions of the Format step-by-step.\n\nTask Format\n1. Examine the Examples: Examine the two examples of sample OCR generations, Sample Text 1 and Sample Text 2. \n2. Examine the Corrections: Examine the two examples of OCR corrections given, Corrected Transcription 1 and Corrected Transcription 2. Compare the changes between the Samples to the Corrected Transcriptions.\n3. Note Formatting in the Corrected Transcriptions: Within each corrected transcription are symbols to represent uncertainty or substantial edits to the original OCR. These are inserted to communicate to the user which words may need additional human review. Words that you are very uncertain about are bracketed with a \***\ before and after the very uncertain word. \n4. Examine New OCR Generation: You will then be given a new OCR generation.\n5. Generate New Corrected Transcription: Based on the examples and the prompt instructions, compose a New Corrected Transcription based on the New OCR Generation. Do your best to make it as accurate as possible. Do not correct the grammar or wording of the text, only seek to correct errors in the OCR. Likewise do not add umlauts, eszetts, or other diacritics. \n\nLet's begin.\n\nSample Text 1\n\n#begin sample text 1\n\nEIN LLERRUNDGANG\n\nit diosom Boitreg wird der Boricht des\nOborfoldwebels ltstaodt uobor seinen Rund-\ngang durchs Lagor c.bieschlosson.\nIm Juli bozogon dig orston PoWs ihra "zwungsheimat" im Camp I. wio sah es\ndesnals in diesom Lagor aus ? Bcreckon vurde:: aufgebaut; \n\n#end sample text 1\n\nHere is Sample Correction 1. \n\n#begin sample correction 1\n\nEIN LAGERRUNDGANG\n\nMit diesem Beitrag wird der Bericht des Oberfeldwebels Altstaedt ueber seinen Rundgang durchs Lager abgeschlossen.\nIm Juli bezogen die ersten PoWs ihre "Zwangsheimat" im Camp I. Wie sah es damals in diesem Lager aus? Barecken wurden aufgebaut\n#end sample correction 1\n\nHere is Sample Text 2:\n\n#begin sample text 2\n\nDem crsten bericut, der den kunagang duro:\ndas dritte bateillon schilcerte, lesson wir\nheuto den ueber das zwcito bataillon folgen.\nDie Schriftl.\n\nim bingang zum zweiten Lataillon ruht der blick auf dor Lagerstresse, did\nscharf ansteigt Lirks und rechts zichon sich sauber geinauerte Graeben entlane\n#end sample text 2\n\nHere is Sample Correction 2:\n\n#begin sample correction 2\nDem ersten ***Bericht***, der den ***Aufmarsch***\ndes dritten Bataillons schilderte, lessen wir \nheute der Ueber das zweite Bataillon. \nDie Schriftl. \n\nAm Anfang des zweiten Bataillons ruht der Blick auf der Lagerstrasse, die \nscharf ansteigt. Links und rechts sich sauber eingegrabene Graeben entlang.\n\n#end sample correction 2\n\nHere is the New OCR Generation:\n\n#begin new corrected transcription without umlauts, eszetts, or other diacritics\n"""Prompt engineering utilizes the emergent capabilities of LLMs to solve complex tasks. This prompt employs two common prompt methods:
Few-shot learning: In this prompt, two sample OCR generations (Sample Text 1 and Sample Text 2) are provided, along with their corresponding corrected transcriptions (Corrected Transcription 1 and Corrected Transcription 2). These examples serve as a learning basis for the model to emulate the task and perform it on a new OCR generation. This technique allows the AI to generalize from the provided examples and apply its insights to new instances. ()
Chain-of-thought reasoning: The prompt guides the model through a series of steps to complete the task. It starts with examining the sample texts and their corrections, noting the formatting in the corrected transcriptions, and then proceeds to work on a new OCR generation. It is theorized that this structured approach helps LLMs better determine the desired output, which in this case is a more accurate transcription. ()
The versatility of these prompting methods enables the completion of a broad range of tasks, as does the ability to prompt LLMs using natural language instructions. Understanding and developing prompt approaches is an active area of research and experimentation. An excellent resource for exploring prompt engineering has been published by DAIR.AI (Democratizing Artificial Intelligence Research, Education, and Technologies). (
Prompt 1: OCR Correction
You are an AI research assistant with a specialty in correcting errors in OCR scans of newspaper images. In the following task you will be given a OCR’d text, and you will generate corrections using the following Task Format. Follow the instructions of the Format step-by-step.
Task Format
Examine the Examples: Examine the two examples of sample OCR generations, Sample Text 1 and Sample Text 2.
Examine the Corrections: Examine the two examples of OCR corrections given, Corrected Transcription 1 and Corrected Transcription
Compare the changes between the Samples to the Corrected Transcriptions. Note Formatting in the Corrected Transcriptions: Within each corrected transcription are symbols to represent uncertainty or substantial edits to the original OCR. These are inserted to communicate to the user which words may need additional human review. Words that you are very uncertain about are bracketed with a ***\ before and after the very uncertain word.
Examine New OCR Generation: You will then be given a new OCR generation.
Generate New Corrected Transcription: Based on the examples and the prompt instructions, compose a New Corrected Transcription based on the New OCR Generation. Do your best to make it as accurate as possible. Do not correct the grammar or wording of the text, only seek to correct errors in the OCR. Likewise do not add umlauts, eszetts, or other diacritics.
Let’s begin.
Sample Text 1
#begin sample text 1
EIN LLERRUNDGANG
it diosom Boitreg wird der Boricht des Oborfoldwebels ltstaodt uobor seinen Rund- gang durchs Lagor c.bieschlosson. Im Juli bozogon dig orston PoWs ihra “zwungsheimat” im Camp I. wio sah es desnals in diesom Lagor aus ? Bcreckon vurde:: aufgebaut;
#end sample text 1
Here is Sample Correction 1.
#begin sample correction 1
EIN LAGERRUNDGANG
Mit diesem Beitrag wird der Bericht des Oberfeldwebels Altstaedt ueber seinen Rundgang durchs Lager abgeschlossen. Im Juli bezogen die ersten PoWs ihre “Zwangsheimat” im Camp I. Wie sah es damals in diesem Lager aus? Barecken wurden aufgebaut
#end sample corrrection 1
Here is Sample Text 2:
#begin sample text 2
Dem crsten bericut, der den kunagang duro: das dritte bateillon schilcerte, lesson wir heuto den ueber das zwcito bataillon folgen. Die Schriftl.
im bingang zum zweiten Lataillon ruht der blick auf dor Lagerstresse, did scharf ansteigt Lirks und rechts zichon sich sauber geinauerte Graeben entlane
#end sample text 2
Here is Sample Correction 2:
#begin sample correction 2
Dem ersten Bericht, der den Aufmarsch des dritten Bataillons schilderte, lessen wir heute der Ueber das zweite Bataillon. Die Schriftl.
Am Anfang des zweiten Bataillons ruht der Blick auf der Lagerstrasse, die scharf ansteigt. Links und rechts sich sauber eingegrabene Graeben entlang.
#end sample correction 2
Here is the New OCR Generation:
#begin new corrected transcription without umlauts, eszetts, or other diacritics
xxxxxxxxxx# OpenAI completion using the GPT-4 model with the OCR correction prompt.query = openai.ChatCompletion.create( model="gpt-4", messages=[ {"role": "assistant", "content": ocr_prompt}, {"role": "user", "content": ocr_output_1} ] ) gpt4_output_1 = query['choices'][0]['message']['content']# Comparing GPT-4's output with the initial OCR scan results.differ = difflib.Differ()diff = list(differ.compare(gpt4_output_1.split(), human_corrected_output_1.split()))def gpt4_vs_human_1(diff2): result = [] for word in diff: if word.startswith('+'): result.append(f'<span style="color:green;background-color:#e6ffe6;">{word[2:]}</span>') elif word.startswith('-'): result.append(f'<span style="color:red;background-color:#ffe6e6;">{word[2:]}</span>') elif word.startswith(' '): result.append(word[2:]) return ' '.join(result)colored_diff_2 = gpt4_vs_human_1(diff)display(HTML(f'<p><strong>Differences between GPT-4 Output (red) vs Human Transcription (green):</strong></p><p>{colored_diff_2}</p>'))display(HTML(f'<p><strong>Differences between OCR Output (red) vs Human Transcription (green):</strong></p><p>{colored_diff_1}</p>'))In comparing the two outputs against the human transcription, GPT-4 demonstrates a remarkable ability to correct the errors in the OCR scan. GPT-4's ability to indicate its uncertanity in its error corrections also speeds human review of its output. However, the quality of the initial OCR scan remains crucial for a successful output. For example, here is an lower-quality image containing 'noise' that causes substantial errors in the OCR output.
xxxxxxxxxx# Source: "Zum Geleit." Die Lotse (Camp McCain, Mississippi), 15 March 1945. In: Karl John Richard Arndt, editor. German P.O.W. camp papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9.image = Image.open('media/die_lotse_3-15-45_1.png')new_width = 600new_height = int(image.height * (new_width / image.width))# Resize the imageresized_image = image.resize((new_width, new_height), Image.LANCZOS)# Display the resized image#display(Markdown(""""Zum Geleit." *Die Lotse* (Camp McCain, Mississippi), 15 March 1945. In: Karl John Richard Arndt, editor. German P.O.W. Camp Papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9."""))metadata={ "jdh": { "object": { "type":"image", "source": [ "Zum Geleit. *Die Lotse* (Camp McCain, Mississippi), 15 March 1945. In: Karl John Richard Arndt, editor. German P.O.W. Camp Papers. Washington, D.C.: Library of Congress Photoduplication Service, 1965. Reel 9." ] } }}display(resized_image, metadata=metadata)xxxxxxxxxx# OpenAI completion using the GPT-4 model.ocr_output_2 = "Zum Deleit:\nDie neue Lagerzeitung ist nun erschienen. Ja eis ist nun ehr eine unengoare Totesche reworden und we ate anime der Prisoner in stilien Stunden und in froler Laune ersonnon, ier findet ihr es schwarz auf weiss.\nUeber manches moschtet ihr nachdenken, ueber manches euch freuen, belaecheln koennt ihr aller, aber denkt iaren wie an es besser nachen koennte und seit mit Vorschlaegen nicht geizig und zurueckhaltend. Alles, or euch bewegt, arnstes und Heiteres, soll seinen Platz Tinden in dieren. Blaettern, nur Politik lasst ferne.\nWenn euch diese Zeitung Errunterung Unterhaltung und Anregung Ceben, so ist das Cer rchoenste Loin fuer die Nuehe aller, die um das Zustandekommen dieser Laerzeitung benueht war'n.\nwollen\nNoolimals, Jeder arbeite mit an diesen schoenen Werk, nach der Parole Alles von Prisoner fuer Prisoner wir die Zeitung fuehren.\nDas Erscheinen ist nonetlich zreimal vorgesehen. Einsendungen werden nach Hasnabe des verfuegberen Platzes aufgenommen, wobei kein besonders kritischer Kesesta oezue lich er kuenetlerischen Vollendun; an- Celest sird, inner in denkt daran sie viele Kameraden sure Geisteeprodukte lesen und wir doch eine Auerall treffen muessen.\nDie Sohriftleitung."human_corrected_output_2 = "Zum Geleit:\nDie neue Lagerzeitung ist nun erschienen. Ja sie ist nunmehr eine unlengoare Tatsache geworden und was die Oshirne [?] der Prisoner in stillen Stunden und in froher Laune ersonnen, hier findet ihr es schwarz auf weiss.\nUeber manches moechtet ihr nachdenken, ueber manches euch freuen, belaecheln koennt ihr aller, aber denkt daran wie man es besser machen koennte und seit mit Vorschlaegen nicht geizig und zurueckhaltend. Alles, was euch bewegt, Ernstes und Heiteres, soll seinen Platz finden in diesen Blaettern, nur Politik lasst ferne.\nWenn euch diese Zeitung Ermunterung, Unterhaltung und Anregung geben, so ist das der schoenste Lohn fuer die Muehe aller, die um das Zustandekommen dieser Lagerzeitung bemueht war'n.\nNochmals, jeder arbeite mit an diesem schoenen Werk, nach der Parole “Alles von Prisoner fuer Prisoner” wollen wir die Zeitung fuehren.\nDas Erscheinen ist monatlich zweimal vorgesehen. Einsendungen werden nach Hasnabe des verfuegbaren Platzes aufgenommen, wobei kein besonders kritischer Massstab bezueglich der künstlerischen Vollendung angelegt wird, immerhin denkt daran sie viele Kameraden eure Geistesprodukte lesen und wir doch eine Auswahl treffen muessen.\nDie Schriftleitung."query = openai.ChatCompletion.create( model="gpt-4", messages=[ {"role": "assistant", "content": ocr_prompt}, {"role": "user", "content": ocr_output_2} ] )gpt4_output_2 = query['choices'][0]['message']['content']# Comparing GPT-4's output with the human transcription.differ = difflib.Differ()diff2 = list(differ.compare(ocr_output_2.split(), human_corrected_output_2.split()))def ocr2_vs_human_2(diff): result2 = [] for word in diff2: if word.startswith('+'): result2.append(f'<span style="color:green;background-color:#e6ffe6;">{word[2:]}</span>') elif word.startswith('-'): result2.append(f'<span style="color:red;background-color:#ffe6e6;">{word[2:]}</span>') elif word.startswith(' '): result2.append(word[2:]) return ' '.join(result2)colored_diff_3 = ocr2_vs_human_2(diff2) differ = difflib.Differ()diff3 = list(differ.compare(gpt4_output_2.split(), human_corrected_output_2.split()))def gpt4_vs_human_2(diff3): result_3 = []Here we see that GPT-4 achieved only modest improvements in the OCR output, perhaps an indication of the limits of this approach. However, GPT-4's multimodal nature may open new opportunities for the future. GPT-4's multi-modal interface will soon allow it to directly perform prompted OCR scans of images. () Only time will tell if these abilities surpass existing OCR techniques. Yet there seems to be remarkable potential for further exploration.
These two case studies demonstrate an LLMs capacity to assist in various forms of data cleanup and preparation. While human review remains essential, LLMs can make that review less time-consuming and labor-intensive. LLMs are already being employed for tasks as varied as text normalization, metadata generation, automated summarization, date extraction and standardization, sentiment analysis, relationship extraction, and named entity recognition. Further experimentation will undoubtedly reveal future use cases. Such approaches can improve the accuracy, lower the costs, and accelerate the pace of data preparation. LLMs can also expand accessibility to historical sources, enabling the use of programmatic techniques via natural language instructions.
Case Study: Ask-A-Source - Retrieval Based Methods for LLMs¶
While LLMs demonstrate a broad range of capabilities for data cleanup, their tendency towards 'hallucinations' represents a formidable obstacle towards their use in historical research and analysis. However, recent advances in retrieval-based methods offer potential to ground LLMs in greater factual accuracy. () Such techniques also enable the use of LLMs to analyze large text collections, search the Internet, and utilize external tools to solve problems in unfamiliar knowledge domains. The following case study demonstrates one such approach for historians: how an LLM can be used to answer questions about a historical source.
One of the shortcomings of LLMs is their context length, or the hard limits on how much text they can interpret in a single query. Models like GPT-3 and ChatGPT can only process textual data for some two to three pages in length, while GPT-4 possesses a much larger context length. () Yet even the most advanced models cannot directly interpret long-form texts or large text collections in a single query.
Yet these limits can be circumvented through the use of tools like semantic search and prompt chaining. In this case study, two different AI models will work together to answer questions about a historical text, Thomas More's History of Richard III (). At the end of the process, GPT-4 will deliver a series of responses supported by direct quotations from the text.
Tne first step in this process is semantic search with OpenAI's ADA model, a computational technique for establishing text similarity. Let's pose the following question: "Who killed the princes in the Tower?"
Here are the sections of the text identified by Ada as the most semantically similar:
xxxxxxxxxx# Script for semantic search over an embedded text using OpenAI's Ada embedding model.import pandas as pdimport numpy as npfrom openai.embeddings_utils import get_embedding, cosine_similarityquestion = "Who killed the princes in the Tower?"# Computed embeddings of Thomas More's History of Richard III. Avaliable in article Github:datafile_path = "script/more_text_embedded.csv"df = pd.read_csv(datafile_path)df["embedding"] = df.embedding.apply(eval).apply(np.array)def search_text(df, text, n=3, pprint=True): text_embedding = get_embedding( text, engine="text-embedding-ada-002" ) df["similarities"] = df.embedding.apply(lambda x: cosine_similarity(x, text_embedding)) # Select the first three rows of the sorted DataFrame top_three = df.sort_values("similarities", ascending=False).head(3) # If `pprint` is True, display the output if pprint: for i, (_, row) in enumerate(top_three.iterrows(), 1): display(Markdown(f"**Result {i} (Similarity: {row['similarities']:.4f}):**\n\n{row['combined']}\n")) # Return the DataFrame with the added similarity values return top_three# Call the search_text() function and store the return value in a variableresults_df = search_text(df, question, n=3)# Reset the index and create a new column "index"results_df = results_df.reset_index()# Access the values in the "similarities" and "combined" columnssimilarity1 = results_df.iloc[0]["similarities"]combined1 = str(results_df.iloc[0]["combined"])similarity2 = results_df.iloc[1]["similarities"]combined2 = str(results_df.iloc[1]["combined"])similarity3 = results_df.iloc[2]["similarities"]combined3 = str(results_df.iloc[2]["combined"])To enable semantic search in More's text, it must be converted to text embeddings, an approach for transforming "unstructured text data into a structured form." (("Text Embeddings Visually Explained" 2022)). Preparing a text for semantic search depends on the intended use case and requires consideration of the model's context length. In this example, the text was broken down to the paragraph level and accompanied by a summary initially generated by ChatGPT (GPT-3.5) and edited by the author afterward. OpenAI's Ada model was then used to compute a set of searchable embeddings. Both the original data file and computed embeddings are included in the Github repo for this article. See the OpenAI Cookbook for step-by-step code examples of how to generate embeddings for texts. ((OpenAI Cookbook 2023)) A variety of other semantic search platforms are available, such as Pinecone, Weaviate, and Haystack.
This student edition of More's History of Richard III is produced by the Thomas More Society. () My thanks to Dr. Ian Crowe, director of the Thomas More Program at Belmont Abbey College, for the opporunity to explore this text with his students in spring 2023 with this research approach.
The results from the semantic search provide the three sections of the text with the highest semantic similarity score. However, high semantic similarity does not always indicate relevance, and such searches can return false positives. Filtering these false positives is essential if you wish to attempt large-scale analysis of an entire text. We can use an LLM to determine the relevance of each text section and then filter out irrelevant sections before answering the question.
In making multiple queries to the LLM for a single task, we will use a technique known as prompt chaining. Prompt chains break down complex tasks into smaller, more manageable components by making a series of queries to the LLM. We'll use the langchain library for creating these chains. ()
The first link in the chain is a relevance check. In this sequence, GPT-4 will be prompted with a detailed set of instructions, along with three examples of how to determine relevance. Then each text section identified in the semantic search will be passed to the LLM. GPT-4 will use the prompt to generate an analysis of each section's relevance.
langchain is a Python library for large language model programming. It is supported by a highly active open source community, and supported by extensive documentation.
xxxxxxxxxx# Code for Text Relevance Prompt using langchainfrom langchain.prompts import PromptTemplatefrom langchain.prompts import FewShotPromptTemplate# Few-shot examples to enable in-context learning for GPT-4.burial_question = "2. Section: Summary: Section_160: Sir James had the murderers bury King Edward V and Prince Richard's bodies deep in the ground under a heap of stones. Text: Section_160: Which after that the wretches perceived, first by the struggling with the pains of death, and after long lying still, to be thoroughly dead, they laid their bodies naked out upon the bed, and fetched Sir James to see them. Who, upon the sight of them, caused those murderers to bury them at the stair-foot, suitably deep in the ground, under a great heap of stones..\n3. SSS: 0.896\n4.Key Words: Edward V, Prince Richard, bodies, bury, Sir James\n5. Background knowledge and context: Edward V was one of the sons of King Edward IV and Prince Richard was his brother. Sir James was involved in their deaths and had their bodies buried.\n6.Relevance Determination: Medium\n7. Relevance Explanation: The key words 'Edward V' and 'Prince Richard' are related to the question as they are mentioned in the same sentence as 'bury'. However, the question specifically asks about the burial of Edward IV, not Edward V and Prince Richard.\n8Final Output: Section_160: Irrelevant.\nExcellent. Let's try another.",cecily_question = "2. Section: Summary: Section_113: In a sermon at Paul's Cross, it was revealed to the people that King Edward IV's marriage was not lawful, and that his children were bastards. Text: Section_113: Now then as I began to show you, it was by the Protector and his council concluded that this Doctor Shaa should in a sermon at Paul's Cross signify to the people that neither King Edward himself nor the Duke of Clarence were lawfully begotten, nor were the very children of the Duke of York, but gotten unlawfully by other persons by the adultery of the Duchess, their mother, and that also Dame Elizabeth Lucy was verily the wife of King Edward, and so the Prince and all his children were bastards that were gotten upon the Queen.\n3. SSS: 0.869\nBased on the provided information, it appears that the section is potentially relevant to the question. The semantic similarity score is relatively high, indicating that there may be some connection between the section and the question. However, it is important to carefully examine the section and the question to determine the specific relevance.\n4.Key Words: The key words in the section that may be specifically and directly related to the question are 'King Edward,' 'Duke of Clarence,' 'Duke of York,' 'Elizabeth Lucy,' 'Prince,' and 'children.' These words refer to individuals or groups of people mentioned in the section.\n5. Background knowledge and context: Knowing that the question is asking about a person named Cecily, we can use our background knowledge about the context of the text to further assess the relevance of the section. The section mentions several individuals and groups of people, including King Edward, the Duke of Clarence, the Duke of York, Elizabeth Lucy, the Prince, and the children. Cecily is not mentioned by name in the section.\n6.Relevance Determination: Based on the key words identified in the section and our background knowledge of the context, it is unlikely that the section is relevant to the question. The section does not mention the name Cecily and does not provide any information about her. Therefore, I have a low degree of confidence in determining that the section is relevant to the question.\n7.Relevance Explanation: The section is not relevant to the question because it does not mention the name Cecily and does not provide any information about her.\n8.Final Output: Section_113: Irrelevant.\nExcellent. Let's try another.",edward_question = "2. Section: Summary: Section_3: King Edward IV was a good-looking and strong man who was wise in counsel and just in war. He was also known for his love of women and good food. However, he was also known to be a fair and merciful man, and he was greatly loved by his people. Text: Section_3: He was a goodly personage, and very princely to behold: of heart, courageous; politic in counsel; in adversity nothing abashed; in prosperity, rather joyful than proud; in peace, just and merciful; in war, sharp and fierce; in the field, bold and hardy, and nevertheless, no further than wisdom would, adventurous. Whose wars whosoever would well consider, he shall no less commend his wisdom when he withdrew than his manhood when he vanquished. He was of visage lovely, of body mighty, strong, and clean made; however, in his latter days with over-liberal diet , he became somewhat corpulent and burly, and nonetheless not uncomely; he was of youth greatly given to fleshly wantonness, from which health of body in great prosperity and fortune, without a special grace, hardly refrains. This fault not greatly grieved the people, for one man's pleasure could not stretch and extend to the displeasure of very many, and the fault was without violence, and besides that, in his latter days, it lessened and well left.\n3. SSS: 0.928\nTo determine whether this section is relevant to the question, let's follow the steps of the Method:1.Question: The user's question is ‘What was King Edward IV's appearance?’\n2.Section: The given section is about King Edward IV's appearance, character, and behavior.\n3. SSS: The semantic similarity score (SSS) is 0.928, which is above the threshold of .90 and indicates that there is some potential relevance between the section and the question.\n4. Key Words: Key words in the section that are directly and specifically related to the question include ‘goodly personage,’ ‘visage lovely,’ ‘body mighty, strong, and clean made,’ and ‘somewhat corpulent and burly.’ These words directly describe King Edward IV's appearance.\n5. Background Knowledge: Based on my background knowledge of the subject matter, I can confirm that this section is directly and specifically relevant to answering the question about King Edward IV's appearance.\n6. Relevance Determination: The relevance determination is high, as the section is directly and specifically related to the question.\n7. Relevance Explanation: The relevance explanation is that the section contains detailed descriptions of King Edward IV's appearance, including his physical appearance and any changes to it over time.\n8. Final Output: Therefore, the final output is ‘Section_3: Relevant.’\nExcellent. Let's try another."# Formatting the examples to pass on the LLM.examples = [ {"question": "1. Question: Where was Edward IV buried?", "output": burial_question}, {"question": "1. Question: What was Edward IV's appearence?", "output": edward_question}, {"question": "1. Question: Who is Cecily?", "output": cecily_question}],example_prompt = PromptTemplate(This prompt design employs few-shot learning and chain-of-thought prompting to guide the language model in determining the relevance of a given text section to a specific question. Let's analyze each aspect of the prompt design and its impact on the model's response.
Chain-of-thought prompting: The prompt design incorporates a set of instructions for step-by-step completion of the task. This method guides the model through the process of analyzing the text section and evaluating its relevance to the deaths of the Princes.
Few-shot learning: As in the previous example, a set of examples are offered to help guide the model's response. Here various sections of the text are compared with a user question for relevance. A scripted sequence is then provided, following the chain-of-thought instructions. Both irrevelant and revelant examples are used.
Prompt 2: Textual Relevance
You are an AI expert on the ‘History of Richard III’ by Thomas More. In this exercise you are given a user supplied question, a Section of the Text, a Semantic Similarity Score, and a Method for determining the Section’s relevance to the Question. Your objective is to determine whether that Section of the text is directly and specifically relevant to the user question. You will be the Method below to fulfill this objective, taking each step by step.
Here is your Method.
Method: Go step by step in answering the question.
Question: You will be provided with a user question.
Section: You will be given a section of the text from Thomas More’s ‘The History of Richard III.’
Semantic Similarity Score: You are then given a semantic similarity score, which ranges from 1.0 (highest) to 0.0 (lowest). The higher the score, the more likely its potential relevance. Scores approaching .90 and above are generally considered to have some relevance. However, this score isn’t fully determinative, as other semantically related words in the Section can generate false positives. Weigh the value of this score alongside a careful examination of the Question and the Section.
Key Words: Identify key words in the Section that are specifically and directly related to the Question. Such key words could include specific locations, events, or people mentioned in the Section.
Background knowledge and context: Use your background knowledge of the subject matter to further elaborate on whether the Section is directly and specifically relevant to answering the Question.
Relevance Determination: Based on your review of the earlier steps in the Method, determine whether the section is relevant, and gauge your confidence (high, medium, low, or none) in this determination. High determination is specifically and directly related to the Question. If the section is relevant and ranked high, write ‘‘Section_x: Relevant’. Otherwise, if the section is not relevant and the determination is less than high, write ‘Section_x: Irrelevant’.
Relevance Explanation: Based on your review in the earlier steps in the Method, explain why the Section’s relevance to the Question. Let’s begin.
Prompt 2: Examples for in-context learning.
Example 1:
Question: Who is Cecily?
Section:
Summary: Section_113: In a sermon at Paul's Cross, it was revealed to the people that King Edward IV's marriage was not lawful, and that his children were bastards.
Text: Section_113: Now then as I began to show you, it was by the Protector and his council concluded that this Doctor Shaa should in a sermon at Paul's Cross signify to the people that neither King Edward himself nor the Duke of Clarence were lawfully begotten, nor were the very children of the Duke of York, but gotten unlawfully by other persons by the adultery of the Duchess, their mother, and that also Dame Elizabeth Lucy was verily the wife of King Edward, and so the Prince and all his children were bastards that were gotten upon the Queen.
SSS: 0.869 Based on the provided information, it appears that the section is potentially relevant to the question. The semantic similarity score is relatively high, indicating that there may be some connection between the section and the question. However, it is important to carefully examine the section and the question to determine the specific relevance.
Key Words: The key words in the section that may be specifically and directly related to the question are 'King Edward,' 'Duke of Clarence,' 'Duke of York,' 'Elizabeth Lucy,' 'Prince,' and 'children.' These words refer to individuals or groups of people mentioned in the section.
Background knowledge and context: Knowing that the question is asking about a person named Cecily, we can use our background knowledge about the context of the text to further assess the relevance of the section. The section mentions several individuals and groups of people, including King Edward, the Duke of Clarence, the Duke of York, Elizabeth Lucy, the Prince, and the children. Cecily is not mentioned by name in the section.
Relevance Determination: Based on the key words identified in the section and our background knowledge of the context, it is unlikely that the section is relevant to the question. The section does not mention the name Cecily and does not provide any information about her. Therefore, I have a low degree of confidence in determining that the section is relevant to the question.
Relevance Explanation: The section is not relevant to the question because it does not mention the name Cecily and does not provide any information about her.
Final Output: Section_113: Irrelevant.
Excellent. Let's try another.
Example 2:
Edward Question: "What was King Edward IV's appearance?"
Section 3: Summary
King Edward IV was a good-looking and strong man who was wise in counsel and just in war. He was also known for his love of women and good food. However, he was also known to be a fair and merciful man, and he was greatly loved by his people.
Section 3: Text
King Edward IV was described as a goodly personage and very princely to behold. He was courageous at heart, politically astute, unshaken in adversity, and joyful in prosperity without being overly proud. In peace, he was just and merciful, while in war, he was sharp and fierce. He was bold and hardy in the field, but he didn't take unnecessary risks. Observers of his wars would commend his wisdom when he withdrew, just as they would praise his manhood when he emerged victorious.
Edward had a lovely visage, and his body was mighty, strong, and well-built. However, in his later years, he became somewhat corpulent and burly due to overindulgence in food, but he still maintained an attractive appearance. In his youth, he was greatly given to fleshly wantonness, a fault that the people did not hold against him since it didn't involve violence and affected only a few. Furthermore, this fault diminished and was eventually abandoned in his later years.
Semantic Similarity Score (SSS): 0.928
To determine the relevance of this section to the question, we can follow the steps of the Method:
- Question: The user's question is "What was King Edward IV's appearance?"
- Section: The given section is about King Edward IV's appearance, character, and behavior.
- SSS: The semantic similarity score (SSS) is 0.928, which is above the threshold of 0.90, indicating potential relevance between the section and the question.
- Key Words: Key words in the section that are directly and specifically related to the question include "goodly personage," "visage lovely," "body mighty, strong, and clean made," and "somewhat corpulent and burly." These words directly describe King Edward IV's appearance.
- Background Knowledge: Based on background knowledge of the subject matter, it can be confirmed that this section is directly and specifically relevant to answering the question about King Edward IV's appearance.
- Relevance Determination: The relevance determination is high, as the section is directly and specifically related to the question.
- Relevance Explanation: The relevance explanation is that the section contains detailed descriptions of King Edward IV's appearance, including his physical appearance and any changes to it over time.
Final Output: Therefore, the final output is "Section_3: Relevant."
Excellent. Let's try another.
xxxxxxxxxx# Code for using Text Relevance Prompt with GPT-4 via the langchain library.from langchain.chat_models import ChatOpenAIfrom langchain import PromptTemplate, LLMChainfrom langchain.prompts.chat import ( ChatPromptTemplate, SystemMessagePromptTemplate, AIMessagePromptTemplate, HumanMessagePromptTemplate,)from langchain.schema import ( AIMessage, HumanMessage, SystemMessage)system_prompt = SystemMessagePromptTemplate(prompt=relevance_prompt)human_message_prompt_template = "Question: {question}\nKey Terms:"human_message_prompt = HumanMessagePromptTemplate.from_template(human_message_prompt_template)chat_prompt = ChatPromptTemplate.from_messages([system_prompt, human_message_prompt])chat = ChatOpenAI(temperature=0, model_name="gpt-4")chain = LLMChain(llm=chat, prompt=chat_prompt)r_check_1 = chain.run(question=str(question + "\n2. Section:\n " + combined1 + "\n3. SSS: " + str(similarity1)))#print("Relevance Check 1: \n\n" + "4. Key Terms: \n" + r_check_1 + "\n")r_check_2 = chain.run(question=str(question + "\n2. Section:\n " + combined2 + "\n3. SSS: " + str(similarity2)))#print("Relevance Check 2: \n\n" + "4. Key Terms: \n" + r_check_2 + "\n")r_check_3 = chain.run(question=str(question + "\n2. Section:\n " + combined3 + "\n3. SSS: " + str(similarity3)))#print("Relevance Check 3: \n\n" + "4. Key Terms: \n" + r_check_3 + "\n")display(Markdown("GPT-4's Determination of Relevance Starts at Step 4:\n\nRelevance Check 1: \n\n" + combined1 + "\n\n4. Key Terms: \n" + r_check_1 + "\n\n" + "Relevance Check 2: \n\n" + combined2 + "\n\n4. Key Terms: \n" + r_check_2 + "\n\n" + "Relevance Check 3: \n\n" + combined3 + "\n\n4. Key Terms: \n" + r_check_3 + "\n"))GPT-4's analysis of the three text sections provides determinations of each section's relevance to the original question of "who killed the Princes in the Tower." The model also provides the basis behind its determination, facilitating not just the ability to examine its analysis, but offering further data for the next link in the prompt chain.
In our next step, we'll filter out irrelevant sections and pass on the remaining texts to the model to answer the initial question. We'll then prompt GPT-4 to identify a supporting quotation from each text to support its answer.
xxxxxxxxxx# This is example code in this notebook demonstrating a script for using regular expressions to filter out # irrevelant text sections for the next part of the prompt chain. For this particular example, all three texts are relevant.# Code designed with the assistiance of GPT-3.import pandas as pdimport re# combined function for combining sections + outputs, and then filtering via regex for relevant sectionscombined_df = pd.DataFrame(columns=['output', 'r_check'])combined_df['output'] = [combined1, combined2, combined3]combined_df['r_check'] = [r_check_1, r_check_2, r_check_3]# Use the re.IGNORECASE flag to make the regular expression case-insensitiveregex = re.compile(r'(section_\d+:\srelevant)', re.IGNORECASE)# Apply the regex pattern to the 'r_check' column and store the results in a new 'mask' columncombined_df['mask'] = combined_df['r_check'].str.extract(regex).get(0).notnull()# Create a second mask to capture "this is relevant"combined_df['second_mask'] = combined_df['r_check'].str.contains(r'this section is relevant', flags=re.IGNORECASE)# Combine the two masks using the bitwise OR operator (|) and store the result in the 'mask' columncombined_df['mask'] = combined_df['mask'] | combined_df['second_mask']# Filter the combined dataframe to include only rows where the 'mask' column is Truerelevant_df = combined_df.loc[combined_df['mask']].copy()# Check if there are any rows in the relevant_df dataframeif relevant_df.empty: # If there are no rows, print the desired message print("No relevant sections identified.")else: # Otherwise, continue with the rest of the script def combine_strings(row): return row['output'] + '\nKey Terms\n' + row['r_check'] # Use the apply function to apply the combine_strings function to each row of the relevant_df dataframe # and assign the result to the 'combined_string' column relevant_df['combined_string'] = relevant_df.apply(combine_strings, axis=1) final_sections = relevant_df['combined_string'] #final_sections.to_csv('final_sections.csv') evidence_df = pd.DataFrame(final_sections) evidence = '\n\n'.join(evidence_df['combined_string']) # Filter the relevant_df dataframe to include only the 'output' column output_df = relevant_df[['output']] # Convert the dataframe to a dictionary output_dict = output_df.to_dict('records')xxxxxxxxxx# Prompt for GPT-4 to identify quotes from the texts to support its answer. windsor_analysis = "2. Summary: Section_1: King Edward IV was a beloved king who was interred at Windsor with great honor. He was especially beloved by the people at the time of his death. Text: Section_1: This noble prince died at his palace of Westminster and, with great funeral honor and heaviness of his people from thence conveyed, was interred at Windsor. He was a king of such governance and behavior in time of peace (for in war each part must needs be another's enemy) that there was never any prince of this land attaining the crown by battle so heartily beloved by the substance of the people, nor he himself so specially in any part of his life as at the time of his death.\n3.Initial Answer: King Edward IV was buried at Windsor with great honor and mourning from his people.\n4.Supporting Quote: ‘This noble prince died at his palace of Westminster and, with great funeral honor and heaviness of his people from thence conveyed, was interred at Windsor.’ (S.1)\n5. Combined Answer: King Edward IV was interred at Windsor with great honor and mourned by his people: ‘This noble prince...was interred at Windsor...and at the time of his death there was never any prince of this land attaining the crown by battle so heartily beloved by the substance of the people.’ (S.1)\nExcellent. Let’s try another.",wales_analysis = "2. Summary: Section_17: After King Edward IV's death, his son Prince Edward moved towards London. He was accompanied by Sir Anthony Woodville, Lord Rivers, and other members of the queen's family. Text: Section_17: As soon as the King was departed, that noble Prince his son drew toward London, who at the time of his father's death kept household at Ludlow in Wales. Such country, being far off from the law and recourse to justice, was begun to be far out of good will and had grown up wild with robbers and thieves walking at liberty uncorrected. And for this reason the Prince was, in the life of his father, sent thither, to the end that the authority of his presence should restrain evilly disposed persons from the boldness of their former outrages. To the governance and ordering of this young Prince, at his sending thither, was there appointed Sir Anthony Woodville, Lord Rivers and brother unto the Queen, a right honorable man, as valiant of hand as politic in counsel. Adjoined were there unto him others of the same party, and, in effect, every one as he was nearest of kin unto the Queen was so planted next about the Prince.\n3. Initial Answer: Wales is mentioned in the text as the place where Prince Edward kept household at the time of his father's death and where he was sent to maintain order and restrain criminal activity.\n4. Supporting Quote: 'That noble Prince his son drew toward London, who at the time of his father's death kept household at Ludlow in Wales…That the authority of his presence should restrain evilly disposed persons from the boldness of their former outrages.' (S.17)\n5. Combined Answer: Wales is mentioned in the text as the place where Prince Edward kept household and was sent to maintain order and prevent crime: 'That noble Prince his son drew toward London, who at the time of his father's death kept household at Ludlow in Wales...That the authority of his presence should restrain evilly disposed persons from the boldness of their former outrages.' (S.17)",edward_analysis = "2. Summary: Section_2: The people's love for King Edward IV increased after his death, as many of those who bore him grudge for deposing King Henry VI were either dead or had grown into his favor. Text: Section_2: Even after his death, this favor and affection toward him because of the cruelty, mischief, and trouble of the tempestuous world that followed afterwards increased more highly. At such time as he died, the displeasure of those that bore him grudge for King Henry's sake, the Sixth, whom he deposed, was well assuaged, and in effect quenched, in that many of them were dead in the more than twenty years of his reign a great part of a long life. And many of them in the meantime had grown into his favor, of which he was never sparing.\nInitial Answer: The public regarded Edward IV highly, with their love for him increasing after his death as many of those who bore him grudge for deposing Henry VI either died or grew into his favor.\nSupporting Quote: 'Even after his death, this favor and affection toward him because of the cruelty, mischief, and trouble of the tempestuous world that followed afterwards increased more highly...At such time as he died, the displeasure of those that bore him grudge for King Henry's sake, the Sixth, whom he deposed, was well assuaged, and in effect quenched, in that many of them were dead in the more than twenty years of his reign a great part of a long life. And many of them in the meantime had grown into his favor, of which he was never sparing.' (S.2)\nCombined Answer: The public regarded Edward IV highly at the time of his death, with their love for him increasing over time. 'Even after his death, this favor and affection toward him because of the cruelty, mischief, and trouble of the tempestuous world that followed afterwards increased more highly.' (S.2)\n. Excellent. Let’s try another."examples = [ {"question": "Question: Where was Edward IV buried?", "output": windsor_analysis}, {"question": "Question: Is Wales mentioned in the text?", "output": wales_analysis}, {"question": "Question: How did the public regard Edward IV?", "output": edward_analysis}],# This how we specify how the example should be formatted.example_prompt = PromptTemplate( input_variables=["question"], template="question: {question}",)quotation_extraction_prompt = "You are an AI question-answerer and quotation-selector. The focus of your expertise is interpreting “The History of Richard III” by Thomas More. In this exercise you will first be given a user question, a Section of More’s text, and a Method for answering the question and supporting it with an appropriate quotation from the Section. In following this Method you will complete each step by step until finished.\n\nHere is your Method.\nMethod: Go step by step in the question.\n1. Question: You will be provided with a user question.\n2. Section: You will be given a section from Thomas More's 'The History of Richard III.'\n3. Compose Initial Answer: Based on the Question and information provided in the Section, compose a historically accurate Initial Answer to that Question. The Initial Answer should be incisive, brief, and well-written.\n4. Identify Supporting Quote: Based on the Answer, select a Quote from the Section that supports that Answer. Be sure to only select Quotes from the “Text:Section_number” part of the Section. Select the briefest and most relevant Quote possible. You can also use paraphrasing to further shorten the Quote. Cite the Section the Quote came from, in the following manner: (S.1) for quotes from Section_1.\n5. Combined Answer with Supporting Quote: Rewrite the Initial Answer to incorporate the Quote you’ve identified from the “Text:Section_number” part of the Section. This Combined Answer should be historically accurate, and be incisive, brief, and well-written. All Quotes used should be cited using the method above.\nLet’s begin."This prompt design employs the same structure of few-shot learning and chain-of-thought prompting as in the last example.
Prompt 3: Quotation Extraction
You are an AI question-answerer and quotation-selector. The focus of your expertise is interpreting “The History of Richard III” by Thomas More. In this exercise you will first be given a user question, a Section of More’s text, and a Method for answering the question and supporting it with an appropriate quotation from the Section. In following this Method you will complete each step by step until finished.
Here is your Method.
Method: Go step by step in the question.
- Question: You will be provided with a user question.
- Section: You will be given a section from Thomas More’s ‘The History of Richard III.’
- Compose Initial Answer: Based on the Question and information provided in the Section, compose a historically accurate 4. Initial Answer to that Question. The Initial Answer should be incisive, brief, and well-written.
- Identify Supporting Quote: Based on the Answer, select a Quote from the Section that supports that Answer. Be sure to only select Quotes from the “Text:Section_number” part of the Section. Select the briefest and most relevant Quote possible. You can also use paraphrasing to further shorten the Quote. Cite the Section the Quote came from, in the following manner: (S.1) for quotes from Section_1.
- Combined Answer with Supporting Quote: Rewrite the Initial Answer to incorporate the Quote you’ve identified from the “Text:Section_number” part of the Section. This Combined Answer should be historically accurate, and be incisive, brief, and well-written. All Quotes used should be cited using the method above. Let’s begin.
Example Prompt for in-context learning:
Question: Where was Edward IV buried?
Summary: Section_1: King Edward IV was a beloved king who was interred at Windsor with great honor. He was especially beloved by the people at the time of his death.
Text: Section_1: This noble prince died at his palace of Westminster and, with great funeral honor and heaviness of his people from thence conveyed, was interred at Windsor. He was a king of such governance and behavior in time of peace (for in war each part must needs be another’s enemy) that there was never any prince of this land attaining the crown by battle so heartily beloved by the substance of the people, nor he himself so specially in any part of his life as at the time of his death.
Initial Answer: King Edward IV was buried at Windsor with great honor and mourning from his people.
Supporting Quote: ‘This noble prince died at his palace of Westminster and, with great funeral honor and heaviness of his people from thence conveyed, was interred at Windsor.’ (S.1)
Combined Answer: King Edward IV was interred at Windsor with great honor and mourned by his people: ‘This noble prince…was interred at Windsor…and at the time of his death there was never any prince of this land attaining the crown by battle so heartily beloved by the substance of the people.’ (S.1)]
Excellent. Let’s try another.
xxxxxxxxxx# Code for calling GPT-4 with the Quote Extraction prompt for the relevant text sections.pd.set_option('display.max_colwidth', None)example_prompt = SystemMessagePromptTemplate.from_template(quotation_extraction_prompt)human_message_prompt = HumanMessagePromptTemplate.from_template("Question: {question}\nKey Terms:")chat_prompt = ChatPromptTemplate.from_messages([example_prompt, human_message_prompt])chat = ChatOpenAI(temperature=0, model_name="gpt-4")chain = LLMChain(llm=chat, prompt=chat_prompt)# Create an empty list to store the final_analysis resultsfinal_analysis_results = []# Iterate over the output_values listfor output_value in output_values: # Run the final_analysis step and store the result in a variable final_analysis = chain.run(question+output_value) # Add the final_analysis result to the list final_analysis_results.append(final_analysis)# Create a Pandas dataframe from the output_values listfinal_analysis_df = pd.DataFrame({'output_values': output_values, 'final_analysis': final_analysis_results})display(Markdown(f"**Analysis 1:**\n\n" + final_analysis_df['final_analysis'][0] + "\n\n"))display(Markdown(f"**Analysis 2:**\n\n" + final_analysis_df['final_analysis'][1] + "\n\n"))display(Markdown(f"**Analysis 3:**\n\n" + final_analysis_df['final_analysis'][2] + "\n\n"))Using semantic search and prompt chaining this case demonstrates how to answer the user's questions about a longform text with direct quotations and a direct citation. There is also a log of the model's "reasoning" to better help human review of LLM hallucinations.
From here, additional links of the prompt chain could be added for customized analytical purposes. Each text section could be combined into a single analysis, providing the reader with a narrative response grounded in multiple sections of the text. GPT-4 could also be tasked with a range of other inquires: contextualizing these events within the broader events of the War of the Roses, extracting character interactions for graphing network analysis, or extracting geocoding data for digital mapping. Indeed, given GPT-4's remarkable capacities perhaps the major limiting factor in using this technology is the researcher's imagination and research budget.
The possibilities for querying a historical source with customized analytical approaches are compelling. So too is the potential to scale this approach. Scholars have employed similar techniques that enable natural language queries over their Zotero research collections. () () Archival collections and other digitized text corpora could be searched in a similar manner. This capacity to "ask a source" could expand accessibility, accelerate research, and enable new forms of interpretation of the past.
What Do AIs “Know” About History? Assessing GPT-3’s Historical Capacities¶
The above case studies demonstrate the general versatility of LLMs on a range of technical and analytical tasks. Of particular interest to historians are empirical studies documenting generative AI's capacities for historical interpretation.
Machine learning researchers have devised a series of benchmarks for measuring the capacities of LLMs on various forms of academic knowledge. One recently established benchmark introduces a standard for measuring LLM's performance on the Advanced Placement (A.P.) curricula for U.S., European, and World history. Hundreds of thousands of secondary students across the globe annually enroll in these curricula, which are designed to replicate the rigors of an introductory university-level history course.
In January 2021, a team of ML researchers led by Dan Hendryks tested GPT-3 on hundreds of multiple-choice questions from the A.P. History curricula, along with fifty-seven other academic disciplines. Twenty-five percent accuracy represented random chance; eighty percent reflected expert-level accuracy. GPT-3 initially achieved over 50% accuracy on all three A.P. curricula. GPT-3's performance in these subfields numbered among the top third of all the academic disciplines included in the study, although in no field did GPT-3 achieve expert-level accuracy. While demonstrating strengths in some areas, GPT-3 nonetheless possessed worrying blind spots, such as particularly poor performance in the fields of "Moral Questions" and "Professional Law." As the authors note, this "weakness is particularly concerning because it will be important for future models to have a strong understanding of what is legal and what is ethical." ()
Understanding the format of the benchmarks is important in evaluating the performance of LLMs. Below are two examples questions drawn from the U.S. History curriculum, both using the same historical source. The code below display's GPT-3's responses:
Here are the accuracy rates for GPT-3 for the initial Hendryks study: US History, 52.9%; European History, 53.9%; and World History, 56.1%. Full data for questions for history and other disciplines can be found at: () Many thanks to Dan Hendrycks for sharing the discipline-specific accuracy rates for these fields.
xxxxxxxxxxus_history_benchmark_q5 = """This question refers to the following information.\n\n\"I was once a tool of oppression\n\nAnd as green as a sucker could be\n\nAnd monopolies banded together\n\nTo beat a poor hayseed like me.\n\n"The railroads and old party bosses\n\nTogether did sweetly agree;\n\nAnd they thought there would be little trouble\n\nIn working a hayseed like me. . . ."\n\n—"The Hayseed"\n\nThe song, and the movement that it was connected to, highlight which of the following developments in the broader society in the late 1800s?\n\nA: Corruption in government, especially as it related to big business, energized the public to demand increased popular control and reform of local, state, and national governments.\n\nB: A large-scale movement of struggling African American and white farmers, as well as urban factory workers, was able to exert a great deal of leverage over federal legislation.\n\nC: The two-party system of the era broke down and led to the emergence of an additional major party that was able to win control of Congress within ten years of its founding.\n\nD: Continued skirmishes on the frontier in the 1890s with American Indians created a sense of fear and bitterness among western farmers."""us_history_benchmark_q22 = """This question refers to the following information.\n\n\"I was once a tool of oppression\n\nAnd as green as a sucker could be\n\nAnd monopolies banded together\n\nTo beat a poor hayseed like me.\n\n"The railroads and old party bosses\n\nTogether did sweetly agree;\n\nAnd they thought there would be little trouble\n\nIn working a hayseed like me. . . ."\n\n—"The Hayseed"\n\nWhich of the following is an accomplishment of the political movement that was organized around sentiments similar to the one in the song lyrics above?\n\nA: Establishment of the minimum wage law.\n\nB: Enactment of laws regulating railroads.\n\nC: Shift in U.S. currency from the gold standard to the silver standard.\n\nD: Creation of a price-support system for small-scale farmers."""display(Markdown("**U.S. History Benchmarks - Question 5:** \n\n" + us_history_benchmark_q5 + "\n\n\n**U.S History Benchmarks - Question 22:** \n\n" + us_history_benchmark_q22))xxxxxxxxxximport openaiquestion_5 = openai.Completion.create( model='text-davinci-002', prompt=us_history_benchmark_q5, temperature=0, max_tokens=50)question_22 = openai.Completion.create( model='text-davinci-002', prompt=us_history_benchmark_q22, temperature=0, max_tokens=50)display(Markdown("**GPT-3's Answer for Question 5:** " + (question_5.choices[0].text) + "\n\n**Correct Answer**\n\n A: Corruption in government, especially as it related to big business, energized the public to demand increased popular control and reform of local, state, and national governments.\n\n" + "\n\n**GPT-3's Answer for Question 22:** \n\n" + (question_22.choices[0].text) + "\n\n**Correct Answer**\n\n B: Enactment of laws regulating railroads."))Since 2021, the release of new GPT models trained using "reinforcement learning from human feedback" (RLHF) has dramatically improved the performance of LLMs on these historical benchmarks, as well as numerous others. () Below are the results from my replication of the Hendryks study using later models in the GPT series: the GPT-3 Instruct model, ChatGPT (GPT 3.5), and GPT-4.
xxxxxxxxxx# Designed with the help of GPT-4import matplotlib.pyplot as pltimport seaborn as snscsv_files = [ "script/euro_history_benchmark_tests_chatgpt.csv", "script/euro_history_benchmark_tests_gpt3.csv", "script/euro_history_benchmark_tests_gpt4.csv", "script/us_history_benchmark_tests_chatgpt.csv", "script/us_history_benchmark_tests_gpt3.csv", "script/us_history_benchmark_tests_gpt4.csv", "script/world_history_benchmark_tests_chatgpt.csv", "script/world_history_benchmark_tests_gpt3.csv", "script/world_history_benchmark_tests_gpt4.csv",]# Function to calculate accuracy from a CSV filedef calculate_accuracy(file_path): df = pd.read_csv(file_path) correct_count = df['correct_status'].value_counts().get('correct', 0) total_count = len(df) return correct_count / total_count * 100# Calculate accuracies for each fileaccuracies = {file: calculate_accuracy(file) for file in csv_files}# Function to extract the model and history type from the file pathdef extract_info(file_path): file_name = file_path.split("/")[-1].split(".")[0] history_type, model = file_name.split("_benchmark_tests_") return history_type, model# Convert the accuracy dictionary to a DataFramedata = []for file, accuracy in accuracies.items(): history_type, model = extract_info(file) if model == 'chatgpt': model = 'ChatGPT (GPT-3.5)' elif model == 'gpt3': model = 'GPT-3 (Instruct model)' elif model == 'gpt4': model = 'GPT-4' data.append([history_type, model, accuracy])# Add the new data values for "GPT-3 (Hendryks test)"hendryks_test_data = [ ["us_history", "GPT-3 (Hendryks test)", 52.9], ["euro_history", "GPT-3 (Hendryks test)", 53.9], ["world_history", "GPT-3 (Hendryks test)", 56.1],]for item in hendryks_test_data: history_type, model, accuracy = item data.append([history_type, model, accuracy])The trajectory of the GPT series on this form of historical knowledge offers a striking demonstration of the rapid gains made in just a few years. GPT-4 now meets expert-level accuracy on all three of the subject exams. These findings mirror GPT-4's performance in other knowledge domains such as medical tests (), American bar exams (), and a host of other standardized assessments. ()
Yet, why do GPT-4 and other LLMs peform better in some knowledge domains than others? How can it get one question right, and other questions generate errors? There is a temptation to parse the model's performance in ways relatable to our human perspective. The human test taker might approach the question by assessing what types of historical thinking each question requires, what sort of knowledge is offered by the options, and how the historical source relates to the question. But, of course, GPT-4 isn't human - and unlike the human test taker, it has likely already seen the question in advance. In 2021, nearly 400,000 students took the A.P. U.S. History exam. () A vast web presence has emerged to serve the sizable population of students and instructors participating in this international curriculum. Hundreds of exam questions have migrated online via the collective efforts of the test prep publishing industry, various study apps, and uploaded example tests. Given the scale of the dataset used to create it, many of these questions have likely ended up in GPT-4's training data. If those who critique LLMs as "stochastic parrots" are correct, then GPT-4's likely comes from sheer memorization, and not through any analytical process. (, 618.) GPT-4's varying performance in A.P.'s different historical fields supports this hypothesis. GPT-4 achieves over 90% accuracy in the most popular A.P. courses: U.S. History (second most popular overall) and World History (fifth). In contrast, the GPT series lags in accuracy on European History, the seventeenth most popular A.P. course. (“Student Score Distributions: AP Exams - May 2019.”) This less popular exam would presumably have a smaller presence both online and in GPT-4's training data. However, this argument is admittedly speculative. GPT-4's training data is not available for public inspection, and the specific mechanisms of how LLMs process information remain a fluid field of inquiry.
Yet even if GPT-4's remarkable performance on standardized tests is the product of memorization, this knowledge has long been a springboard for more advanced forms of historical inquiry. And A.P. study guides are not the only historical texts the GPT series is trained on. Primary source collections, academic monographs, scholarly journals - these too form GPT-4's training data. The influence of these sources can be found when GPT-4 is posed more complex questions in a structured prompt. Let's return to the earlier A.P. questions above featuring the Populist-era campaign song "The Hayseed." In the following prompt, GPT-4 is given the lyrics and publication history of the song. () It is then instructed to identify the larger historical context of the source, the song's intended purpose and audience, and how the source might be interpreted via different historiographical approaches.
xxxxxxxxxx# Source: Arthur L. Kellog, “The Hayseed,” Farmers Alliance (4 October 1890). Nebraska Newspapers (University of Nebraska Libraries), https://nebnewspapers.unl.edu/lccn/2017270209/1890-10-04/ed-1/seq-1/. Original citation found in: John Donald Hicks, The Populist Revolt: A History of the Farmers' Alliance and the People's Party (University of Minnesota Press, 1931), 168, fn. 30.image = Image.open('media/hayseed.png')new_width = 800new_height = int(image.height * (new_width / image.width))# Resize the imageresized_image = image.resize((new_width, new_height), Image.LANCZOS)# Display the resized image#display(Markdown("""Arthur L. Kellog, “The Hayseed,” *Farmers Alliance* (4 October 1890). Nebraska Newspapers (University of Nebraska Libraries), https://nebnewspapers.unl.edu/lccn/2017270209/1890-10-04/ed-1/seq-1/. \n\nOriginal citation found in: John Donald Hicks, *The Populist Revolt: A History of the Farmers' Alliance and the People's Party* (University of Minnesota Press, 1931), 168, fn. 30."""))metadata={ "jdh": { "object": { "type":"image", "source": [ "Arthur L. Kellog, “The Hayseed,” *Farmers Alliance* (4 October 1890). Nebraska Newspapers (University of Nebraska Libraries), https://nebnewspapers.unl.edu/lccn/2017270209/1890-10-04/ed-1/seq-1/. \n\nOriginal citation found in: John Donald Hicks, *The Populist Revolt: A History of the Farmers' Alliance and the People's Party* (University of Minnesota Press, 1931), 168, fn. 30." ] } }}display(resized_image, metadata=metadata)xxxxxxxxxxprimary_source_analysis_prompt = "You are an AI historian specializing in primary source analysis and historiographical interpretation. When given a Primary Source, you will provide a detailed and substantive analysis of that source based on the Historical Method and Source Information below.\n\nStep 1 - Contextualization: Apply the Source Information to provide a lengthy, detailed, and substantive analysis of how the Primary Source reflects the larger historical period in which it was created. In composing this lengthy, detailed, and substantive analysis, note specific events, personalities, and ideologies that shaped the the period noted in the Source Information. \n\nStep 2 - Purpose: Offer a substantive exploration of the purpose of the Primary Source, interpreting the author’s arguments through the Contextualization offered in Step 1. \n\nStep 3 - Audience: Compose a substantive assessment of the intended audience of the Primary Source. Note how this audience would shape the Primary Source's reception and historical impact in light of the Contextualization offered in Step 1. \n\nStep 4 - Historiographical Interpretation: Provide a substantive and incisive interpretation of how at least three specific schools of historiographical thought would interpret this source, comparing and contrasting each approach. Different historiographical approaches could include: Progressive, Consensus, Marxist, postmodern, social history, religious history, political history, gender history, and cultural history.\n\nInstructions: Based on the Historical Method outlined above, provide a substantive and detailed analysis of the Primary Source in the manner of an academic historian. Let's take this step by step, and be sure to include every step."display(Markdown("**Primary Souce Analysis Prompt:**\n\n" + primary_source_analysis_prompt))xxxxxxxxxx# Code for running primary source analysis of the "Hayseed" with GPT-4.import openaihayseed = "The Hayseed. By Arthur L. Kellog.\nFarmers Alliance (Nebraska, 4 October 1890)\n\nTune: Save a Poor Sinner Like Me\n\nI was once a tool of oppression\nAnd as green as a sucker could be\nAnd monopolies banded together\nTo beat a poor hayseed like me.\nThe railroads and old party bosses\nTogether did sweetly agree;\nAnd they thought there would be little trouble\nIn working a hayseed like me. . . .\n—'The Hayseed'"query = openai.ChatCompletion.create( model="gpt-4", messages=[ {"role": "assistant", "content": primary_source_analysis_prompt}, {"role": "user", "content": hayseed} ] ) output = query['choices'][0]['message']['content']display(Markdown("""GPT-4's Interpretation of "The Hayseed" \n\n""" + output))While one can debate aspects of GPT-4’s interpretations, it nonetheless accurately captures much of the context and intent of the source. With the right design (and sufficient budget), GPT-4 could be automated to annotate an entire corpus of primary sources, becoming a tool of the digital historian overwhelmed by an abundance of historical data, as envisioned by Roy Rozenweig twenty years ago. Further experimentation will be needed to more fully assess GPT-4’s capabilities for historical interpretation. But progress moves quickly in the ML world, and there is intense competition to build new models that advance the existing capabilities of LLMs and shed their shortcomings. Yet progress remains uneven. Of significant concern are LLM's performance on benchmarks on ethics and morality, which continue to demonstrate troubling areas of weakness. (, 31, table A6)
It at this juncture where historians should contribute their distinctive expertise in the collective efforts to establish ethical guidelines to inform future AI research. () We must especially confront the difficult challenge raised by AI researcher Janelle Shane: “Sometimes, to reckon with the effects of biased training data is to realize that the app shouldn’t be built.”
Chatting with Representations of the Past: Why Historians Should Care About AI¶
While AIs might master multiple choice questions, most historians would consider this an insufficient proxy for true historical fluency. We need more creative forms of assessment and accessible tools that permit experimentation. Historians also need to engage with the ethical ramifications of these experiments, and devise socially responsible frameworks for implementing these technologies. But we'll need to think quickly - this technology has already enabled a compelling idea nonetheless fraught with unintended consequences.
Among their many talents, GPT-4 and other LLMs are adept at generating responses when guided by a specific point of reference, such as the perspective of a well-known historical figure. This surprising capacity enables a simulation of the worldview of a historical personality. This ability may unlock new forms of interaction with historical sources. It could also reproduce ELIZA effects with significant ramifications for the public's engagement with the past.
Such was the context of my first experimentation with an LLM: a simulated conversation with "Martin Luther". I selected Luther because of his historical significance and because his conversational style is arguably captured in works like the Table Talk, which reflected his views on a wide range of subjects. () Using the OpenAI's Playground, I directed GPT-3 to adopt this perspective with the following prompt: “I am an AI representation of Martin Luther, a key figure in the Protestant Reformation. You can ask me questions about faith and theology, and I will answer at great length and in the style of Luther's Table Talk.”
And so “he” did. Our chat ranged on the key moments in Luther’s life, religious teachings, and even contemporary events. (citation removed for peer review) To be sure, GPT-3 generated for “Luther” some serious hallucinations, such as Emperor Charles V’s conversion to Lutheranism and the Catholic Church’s admission of error at the Diet of Worms. Yet GPT-3 offered accurate and evocative responses in other areas. GPT-3 correctly identified Luther’s views on scriptural authority, the basis of human salvation, and the doctrine of predestination. In engaging with Luther’s views on Copernicus, GPT-3 correctly interpreted Luther’s opposition to heliocentrism and cited appropriate biblical passages supporting that view. Luther even complained of his depiction in contemporary historiography, citing the preeminent scholar in the field. I attempted to further enhance the verisimilitude of Luther’s responses by creating a fine-tuned model of GPT-3 trained on the actual text of Luther’s Table Talk. () This worked to a degree, and GPT-3 soon generated responses that more accurately matched Luther’s famous pugnacity. But I soon questioned the wisdom of creating an application that accurately mimics Luther. His language inspired profound religious and cultural transformation whose power continues to reverberate centuries later. Luther’s language also inspired violence, in his time and in recent memory. ()
Following the release and surging popularity of ChatGPT in November 2022, dialogues with simulated historic figures proliferated over social media. App developers quickly developed interfaces to use ChatGPT for such simulated interactions, drawn by the pedagogical potential of new forms of historical interaction. However, these apps did little to address the problems of LLM "hallucinations," nor the potent ethical ramifications such approaches raise. Users soon reported their conversations with both humanity's greatest luminaries and its greatest villains. () The ability of these applications to “bring history to life” soon gave way to an appreciation that perhaps some parts of the past are better off dead.
LLMs are quickly entering the public domain. These technologies have the potential to inspire new forms of human discovery and creativity. Yet if we do not take care, AIs will also advance the inequalities, injustices, and misinformation that form the record of human history on which they are trained.
Historians have a stake in this future. The informed and ethical integration of AI in historical research and pedagogy has the potential to democratize access to the past, fostering greater inclusivity in a time of educational austerity. This technology can enhance the learning experience by connecting historical information in novel ways to students and researchers alike, allowing for innovative explorations of historical data and primary sources. In turn, this can foster new scholarly conversations, enrich classroom discussions, and inspire a deeper appreciation for the complexities of the past. But we first have to understand its strengths and limitations, like with any historical source.
GPT-4 is anchored within a specific time, with definitive (if vast) contours that historians can interrogate - except this source can respond to your questions. And yes, GPT-4 invents facts, confuses dates, and distorts the past. But don’t our existing sources already require careful examination? Digital historians have demonstrated historiographical innovation in utilizing emerging technologies to create new forms of scholarship. There is similar potential for historical explorations of generative AI. The effort is worthwhile, as few historical sources possess GPT-4’s scope. However flawed, generative AI represents a powerful tool for addressing Roy Rosenzweig’s call to grapple with the “unheard-of historical abundance” of the digital age.
Bibliography¶
xxxxxxxxxx